Skip to content

Conversation

@yushengsu-thu
Copy link
Collaborator

@yushengsu-thu yushengsu-thu commented Jan 7, 2026

Description

  1. [to-do] Need to refactor (move more features to megatron-bridge) and fix bugs
  2. megatron backend
  3. disk sync weight
  4. Update LoRA weights via tensor
    (Update LoRA weights to the SGLang rollout engine via tensor, which is faster than the previous disk sync approach)
    Waiting for this sglang PR to be merged: Update LoRA Weights via Tensor sgl-project/sglang#16226
  5. SGLang patch to PR: /python/sglang/srt/models/qwen2.py - line: 611
# Avoid substring match: skip if name already contains the fused param_name
# e.g., skip "q_proj" match when name contains "qkv_proj"
if param_name in name:
    continue
  1. Need to fix the weight sync problem in LoRa (currently, I did not offload the megatron engine) - --offload-rollout-level kv_cache weight part is not right.

Pre-request

Git clone this megatron-bridge branch: https://github.com/yushengsu-thu/Megatron-Bridge/tree/merged-megatron-0.16.0rc0

cd megatron-bridge
pip install -e . --no-deps --no-build-isolation

pip install megatron-energon --no-deps
pip install multi-storage-client --no-deps
  • Docker
docker run --rm -it \
  --gpus all \
  -p 8264:8264 \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  -v /.ssh/:/.ssh/ \
  -v /data:/data \
  --shm-size 128G \
  --name miles_yusheng \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -w $PWD \
  radixark/miles:latest 
  • Megatron-Bridge
git clone --branch merged-megatron-0.16.0rc0 --single-branch https://github.com/yushengsu-thu/Megatron-Bridge.git
cd Megatron-Bridge
pip install -e . --no-deps --no-build-isolation
pip install megatron-energon --no-deps
pip install multi-storage-client --no-deps

Testing

# Model and model Download
huggingface-cli download --repo-type dataset zhuzilin/gsm8k --local-dir /root/gsm8k
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir /root/Qwen2.5-0.5B-Instruct

# Codebase
git clone --branch miles-lora-megatron --single-branch https://github.com/yushengsu-thu/miles.git 
cd miles
source scripts/models/qwen2.5-0.5B.sh
PYTHONPATH=/root/Megatron-LM/ python \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/Qwen2.5-0.5B-Instruct \
   --save /root/Qwen2.5-0.5B-Instruct_torch_dist/

# Run script:
bash examples/reproducibility/run-qwen2.5-0.5B-gsm8k-lora.sh

Related Issues, PRs:

Lora FSDP backend PR: #377
SGLang sync from tensor: sgl-project/sglang#16226

Code Style Compliance

  • Performance: Minimized synchronization calls (.item(), .cpu(), .tolist()) in inference paths
  • Architecture: No duplicate code > 5 lines; files < 2,000 lines
  • Function Purity: Avoided in-place modification of input arguments (unless explicitly documented for memory optimization)
  • Pythonic: Lean constructors, minimal dynamic attributes, proper type hints on public APIs
  • Testing: Provided a test script that reviewers can copy & paste to run immediately

Copilot AI review requested due to automatic review settings January 7, 2026 20:21
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yushengsu-thu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is a work-in-progress feature that integrates a Megatron backend for LoRA training within the miles framework. It aims to enhance the efficiency of LoRA weight updates by leveraging tensor-based synchronization with SGLang, moving away from disk-based methods. The changes provide foundational support for scalable LoRA fine-tuning using Megatron-LM, demonstrated through a new example script.

Highlights

  • Megatron LoRA Backend Integration: This pull request introduces a Megatron backend for LoRA (Low-Rank Adaptation) training, enabling the use of Megatron-LM for fine-tuning models with LoRA within the miles framework.
  • Tensor-based LoRA Weight Updates: The implementation supports updating LoRA weights via tensors, which is noted as a faster and more efficient method compared to previous disk synchronization approaches. This feature is dependent on an external SGLang pull request.
  • New Example Script for LoRA Training: A new example script (run-qwen2.5-0.5B-gsm8k-lora.sh) has been added, demonstrating how to perform LoRA training for the Qwen2.5-0.5B model on the GSM8k dataset using the new Megatron backend and SGLang.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new reproducibility script for running a LoRA fine-tuning experiment on the Qwen2.5-0.5B model with the GSM8K dataset using the Megatron backend. The script is well-structured, using bash arrays to organize command-line arguments for clarity.

My review focuses on improving the robustness and correctness of the script. I've provided two main suggestions:

  1. Refining the initial cleanup logic to be more robust and safer, by preferring graceful shutdowns and highlighting the risk of using a broad pkill on all python processes.
  2. Correcting the environment variable used to disable Python's output buffering from the non-standard PYTHONBUFFERED to the correct PYTHONUNBUFFERED.

These changes should make the script more reliable and adhere to better practices.

Comment on lines +4 to +11
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cleanup logic at the beginning of the script is quite aggressive and could be improved for safety and robustness.

  • Graceful Shutdown: Using pkill -9 (SIGKILL) immediately prevents processes from cleaning up properly. It's better to first try a graceful shutdown with pkill (SIGTERM).
  • Broad pkill: pkill -9 python is very broad and could terminate unrelated Python processes, which is risky outside of a completely isolated container.
  • Redundancy: The repeated pkill commands suggest the cleanup might be fragile. A single, more robust cleanup sequence is preferable.
Suggested change
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
pkill sglang
ray stop --force
sleep 5 # Wait for processes to terminate gracefully
# Force kill any remaining processes.
# Note: `pkill -9 python` is broad and can be risky.
pkill -9 sglang
pkill -9 ray
pkill -9 python

set -ex

# will prevent ray from buffering stdout/stderr
export PYTHONBUFFERED=16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable PYTHONBUFFERED is not standard. The correct variable to disable output buffering for Python is PYTHONUNBUFFERED. Setting it to any non-empty string (conventionally 1) will have the desired effect of making stdout/stderr unbuffered.

Suggested change
export PYTHONBUFFERED=16
export PYTHONUNBUFFERED=1

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Megatron backend support for LoRA training to the Miles project. The implementation includes disk-based weight synchronization and tensor-based weight update mechanisms for the SGLang rollout engine, representing a work-in-progress feature addition.

Key Changes:

  • Added Megatron backend integration for LoRA training
  • Implemented disk sync weight functionality
  • Added tensor-based LoRA weight update mechanism (pending upstream SGLang PR)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

--rollout-shuffle
--rm-type math
# --num-rollout 100
--num-rollout 10 # onyl train 10 stesp
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "onyl" should be "only". The comment should read "# only train 10 steps".

Suggested change
--num-rollout 10 # onyl train 10 stesp
--num-rollout 10 # only train 10 steps

Copilot uses AI. Check for mistakes.
--rollout-shuffle
--rm-type math
# --num-rollout 100
--num-rollout 10 # onyl train 10 stesp
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "stesp" should be "steps". The comment should read "# only train 10 steps".

Suggested change
--num-rollout 10 # onyl train 10 stesp
--num-rollout 10 # only train 10 steps

Copilot uses AI. Check for mistakes.
CKPT_ARGS=(
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
# Uncomment to save checkpoints (required for LoRA)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Uncomment to save checkpoints (required for LoRA)" but the checkpoint saving arguments on lines 25-26 are already active (not commented out). This creates confusion about whether checkpoints are being saved. Either update the comment to reflect that checkpoints are enabled, or comment out lines 25-26 if they should be optional.

Suggested change
# Uncomment to save checkpoints (required for LoRA)
# Save checkpoints (required for LoRA). Adjust path/interval as needed.

Copilot uses AI. Check for mistakes.
--target-modules "q_proj,k_proj,v_proj,o_proj"
# --target-modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj"
# --lora-sync-from-tensor # Use tensor-based sync (more efficient)
# Uncomment to share base model between actor and ref (saves memory)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Uncomment to share base model between actor and ref (saves memory)" but the --share-ref-base-model argument on line 41 is already active (not commented out). This creates confusion. Either update the comment to reflect that sharing is enabled, or comment out line 41 if it should be optional.

Suggested change
# Uncomment to share base model between actor and ref (saves memory)
# Share base model between actor and ref (saves memory)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant